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ABSTEACT Metbods for alignment of protein isequences 
^teally measiu^ j^bstilarity hy u^ing a subflHtutkm matrix vrfOi 
9car«s for all possible eKctumges of me amino add with 
another. The most widely used matrices are based mi the 
Dayhofir model of evolutJooary rates. a dllterent ap* 

proach, we have d^ved sul^tntloit matrices frmn idwitt 2000 
blocics of aligned sequence segments characterldi^ more than 
SQO groups of related proteins. This led to marked improve- 
ments in aiignmeiits sod in seartiies using qomrles fhim ea<^ 
ttie groups* 



Among the most usefii! computer-based tools in modem 
biology are those that involve sequence alignments of pro- 
teins, since these alignments often provide important insights 
into gene and protein function. There are several different 
types of alignments: global alignments of pairs of proteins 
related by common ancestry throughout their lengths, local 
alignments involving related segments of proteins, multiple 
alignments of members of protein families, and alignments 
made during data base searches to detect homology. In each 
case, competing alignments are evaluated by using a scoring 
scheme for estimating similarity. Although several different 
scoring schemes have been proposed (1-^), the mutation data 
matrices of Dayhoff {1, 7-9) are gencraUy considered the 
standard and are often the default in alignment and searching 
programs* In the Dayhoff model, substitution rates are de- 
rived from alignments of protein sequences that are at least 
85% identical. However, the most common task involving 
substitution matrices is the detection of much more distant 
relationships, which are only inferred from substitution rates 
in the Dayhoff nwHleL Therefore, we wondered whether a 
better approach might be to use alignments in which these 
relationships are explicitly represented. An incentive for 
investigating this possibility is that implementation of an 
improved matrix in numerous important applications re- 
quires only trivial effort, 

METHODS 

Derivhig a Frequency Table from a I>ata Base of Blocks. 
Local alignments can be represented as ungapped blocks with 
each row a different protein segment and each column an 
aligned residue position. Previously, we described an auto- 
mated system, protom at, for obtaining a set of blocks given 
a group of related proteins (10), This system was applied to 
a catalog of several hundred protein groups, yielding a data 
base of >2000 blocks. Consider a single block representing a 
conserved region of a protein family. For a new member of 
this family » we seek a set of scores for matches and mis- 
matches that best favors a correct alignment with each of the 
other segments in the block relative to an incorrect align- 
ment. For each column of the block, we first count the 
number of matches and mismatches of each type between the 



new sequence and every other sequence in the block. For 
example, if the residue of the new sequence that al^s with 
the first column of the first block is A and the column has 9 
A T^sidues and 1 S residue, ^en there are 9 AA matches and 
1 AS mismatch. This procedure is repeated for all columns of 
all blocks with the summed results stored in a table. The new 
sequence is added to the group* For another new seqt^nce, 
the same procedure is followed, summing these numbei^ with 
those akeady in the table. Notice that successive addition of 
each sequence to the group leads to a table consisting of 
counts of aU possible sunino acid pairs in a column. For 
example, in the column consisting of 9 A residues and 1 S 
residue, there are 8 + 7 + , , , 1 = 36 possible AA pah^. 9 
AS or SA pairs, and no SS pairs. Counts of all possible pairs 
in each column of each block in the data base are summed* 
So, if a block has a width of amino acids and a depth of* 
sequences, it contributes ws(s - l)/2 amino acid pairs to the 
count [{1 X 10 X 9)/2 = 45 m the above example] . The result 
of this counting is a frequency table listing the number of 
tunes each of the 20 + 19 ... 1 = 210 different amino acid 
pairs occurs among the blocks. The table is used to calculate 
a matrix representing the odds ratio between these observed 
frequencies and those expected by chance. 

Computing a Logaarifbm of Odds (Lod) MatHx. Let the total 
number of amino actd ij pairs (X < I 20) for each entry 
of the frequency table be/(^. Then the observed probability of 
occurrence for each /. J pair is 

For the column of 9 A residues and 1 S residue in the example, 
where /aa - 36 and /as 9, ^taa = 3«/45 » 0,8 and ^tas =^ 
9/45 0,2. Next we estimate the expected probability of 
occurrence for each i,j pair* It is assumed that the observed 
pair frequencies are those of the population. For the example, 
36 pairs have A in both positions of the pak and 9 pairs have 
A at only one of the two positions, so that the expected 
probability of A in a pair is [36 + (9/2)3/45 - 0,9 and that of 
S is {9/2)/45 0,1* In general, the probability of occurrence 
of the jth amino acid in an i,j pair is 



Pi + 



The expected probability of occurrence for each /, j pair is 
then pipj for / === j and ptpj + pjPi = 2pipj for / ^ j. In the 
example, the expected probability of AA Is 0,9 x 0.9 ^ 0.81, 
that of AS + SA is 2 X (0.9 x 0.1) ^ 0.18. and that c^SS is 
0,1 x 0,1 =« 0.01. An odds ratio matrix Is calculated where 
each entry is gif/e^f. A lod ratio is then calculated m bit units 
as ^ logiiqa/e^). If the observed frequencies are as 
expected, % - 0; if less than expected, < 0; if more than 
expected, > 0. Lod ratios are multtpliea by a scaling factor 
of 2 and then rounded to the nearest integer value to produce 
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BLOSUM (blocks substitution matrix) matrices tn half-bit 
units, comparable to matrices generatect by the pam (percent 
accepted mutation) program (11). For each substitution ma- 
trix, we calculated the average mutual information (12) per 
amino acid pair H (also caOed relative entropy), and the 
expected score E in bit units as 

- S 2 <3ftf X £ = 2 2 Pf X J?/ X 

Clustering Segments Within Blocks. To reduce multiple 
contributions to amino acid pair ^equencies from the most 
closely related members of a family, sequences are clustered 
AVithtn blocks and each cluster Is weighted as a single se- 
quence in counting pairs (13). This is done by specifying a 
clustering percentage in which sequence segments that are 
identical for at least that percentage of amino acids are 
grouped together. For examplet if the percentage is set at 
80%, and sequence segment A is identical to sequence 
segment B at a:80% of their aligned positions, then A and B 
are clustered and their contributions are averaged in calcu- 
lating pair frequencies. If C is identical to either A or B at 
^80% of aligned positions, it is also clustered with them and 
the contributions of A, B, and C are averaged, even though 
C might not be identical to both A and B at ^80% of aligned 
positions. In the above example, if 8 of the 9 sequences with 
A residues In the 9A-1$ column are clustered, then the 
contribution of this column to the frequency table is equiv- 
alent to that of a 2A-1S column, which contributes 2 AS 
pairs. A consequence of clustering is that the contribution of 
closely related segments to the frequency t^le is reduced (or 
eliminated when an entire block is clustered, since this is 
equivalent to a single sequence in which no substitutions 
appear). For example, clustering at 62% reduces the number 
of blocks contributing to the table by 25%, with the remainder 
contributing 1.25 mlllk>n pairs (including fractional pairs), 
whereas without chistering, >15 million pairs are counted 
CF^. 1), in this way, varying the clustering percentage leads 
to a fSunily of matrices. The matrix derived from a data base 
of blocks In which sequence segments that are Identical at 
2:80% of aligned residues are clustered is referred to as 
BLOSUM 80, and so forth. The blosum program implements 




FiQ. 1. Relationship between pmentage clustering and total 
amino acid pair counts ptotted on a logarithmic scale and relative 
entropy. 



matrix construction. Frequency tables, matrices, and pro- 
grams for UNIX and dos machines are available over Internet 
by anonymous ftp (sparky, fhcrc.org). 

Constructing Blodui Data Bases. For this work, we began 
with versions of the blocks data base constructed by froto- 
MAT (10) from 504 nonredundant groups of proteins cata- 
logued in Prosite 8.0 <14) keyed to Swiss-Prot 20 (15). 
PROTOMAT employs an amino acid substitution matrix at two 
distinct phases of block construction (16). The motif pro- 
gram uses a substitution matrix when individual sequences 
are aligned or realigned against sequence segments contain- 
ing a candidate motif (16). The motomat program uses a 
substimtion matrix when a block Is extended to either side of 
the mottf region and when scoring candidate blocks (10), A 
unitary substitution matrix (matches - 1; mismatches ^ 0) 
was used initially, generating 2205 blocks. Next, the blosum 
program was applied to this data base of blocks, clustering at 
60%, and the resulting matrix was used with protomat to 
construct a second data base consisting of 1%1 blocks. The 
BLOSUM program was then applied to this second data base, 
clustering at 60%. This matrix was used to construct version 
5.0 of the BLOCKS data base from 559 groups in Prosite 9,00 
keyed to Swiss-Prot 22. The blosum program was applied to 
this fmal data base of 2106 blocks, using a series of dusteriiq^ 
percentages to obtain a ^unily k>d substitution matrices. 
This series of matrices Is very similar to the series derived 
&om the second data base. Approximately similar matrices 
were also obtained from data bases generated by protomat 
using the pam 120 matrix, usAm^ a matrix with a clustering 
percentage of 80%, and using just the odd- or even-numbered 
groups (data not shown). 

Altennwnts and Bomi^ogy Searcbes* Global multiple align- 
ments were done using version 3.0 of multalin for dos 
computers (17). To provide a positive matrix, each entry was 
increased by S (with default gap penalty of B), Version 1.6b2 
of Pearson's rdf2 program (IB) was used to evaluate local 
pairwise aUgnments. 

Homology searches were done on a Sun Sparcstatton usli^ 
the blast? version of blast dated 3/1S/91 (11) and version 
X,6b2 of fasta (with ktup ^ 1 and ^ options) and ssearch, 
an implementation of the Smith^Waterman algorithm (IS- 
20). The Swiss-Piot 20 data bank (15) containing 22,654 
protem secfuences was searched, ^d one search was done 
with each matrix for each of the 504 groups of proteins from 
Prosite 8.0. The first of the longest and most distant se- 
quences in the group was chosen as a searching query, 
inferring distance from protomat results and Swiss-PUM; 
names. 

In the BLOSUM matrices, the scores for B and Z were made 
Identical to those for D and E, respectively, and -1 was used 
for the character X. We used Uie same gap penalties for aO 
matrices, -12 for the first residue in a gap. and -4 for 
subsequent residues in a gap. 

The results of each search were analyzed by considerii^ 
the sequences used by protomat to construct blocks for the 
protein group as the true positive sequences and all others as 
true negatives, blast reports the data bank matches up to a 
certain level of statistical significance, Hierefore, we counted 
the number of misses as the number of true positive se- 
quences not reported. For fasta and ssearch, we followed 
the empirical evaluation criteria recommended by Pearson 
(19); the number of misses is the number of true positive 
scores, which ranked below the 99,5th p^centile of the tnie 
negative scoi^s. 

RESULTS 

Comparison to Dayhoff Matrices. The blosum series de- 
rived from aiigptunents in blocks is fundamentally different 
from the Dayhoff pam series, which derives from the csti- 
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matioti of mutation rates. Nevertheless, the bex>svm series 
based on percent clustering of aligned segments in blocks can 
be compared to the Daybolf matrices based on pam u^g a 
measure of average information per residue pair in bit units 
called relative entropy (9). Relative entropy is 0 when the 
target {or observed) distribution of pair frequencies is the 
same as the background (or expected) distribution and in- 
creases as these two distributions become more distinguish- 
able. Relative entropy was used by Altschul (9) to charac- 
terize the Dayhoff matrices, which show a decrease with 
increasing fam. For the blosum series, relative entropy 
increases nearly linearly with increasing clustering percent- 
age (Fig, I). Based on relative entropy, the fam 250 matiix 
is comparable to blosum 45 with relative entropy of «»0,4 bit, 
while PAM 120 is comparable to blosum 80 with relative 
entropy of «=1 bit. blosum 62 (Fig. 2 Lower) is intermediate 
in both clustering percentage and relative entropy (0.7 bit) 
and is comparable to pam 160. Matrices with comparable 
relative entropies also have similar expected scores. 

Some consistent differences are seen when pam 160 is 
subtracted from blosum 62 for every matrix entry (Fig. 2 
Upper). Compared to pam 160^ blosum 62 is less tolerant to 
substitutions involving hydrophilic amino acids, while it is 
more tolerant to substitutions involving hydrophobic amino 
acids. For rare amino acids, especially cysteine and tryp- 
tophan» BLOSUM 62 is typically more tolerant to mismatches 
than is pam 160, 

Performance in Multiple Al%mnent of Known Structures. 
One test of sequence alignment accuracy is to compare the 
results obtained to alignments seen in three-dimensional 
structures. Lipman et al (21) applied a simultaneous multiple 
alignment program, msa, to 3 similariy divei^ged serine pro* 
teases of known three-dimensional structures. They found 
that for 161 closely aligned residue positions, 12 residues 
were involved in misalignments* We asked how well a 
hierarchical multiple alignment program, multalin (17), 
perfonns on the same proteins using different substitution 
matrices. Table 1 shows that multalin performs much 
worse than msa using the pam 120, 160, or 250 matrices, 
misaligning residues at 30-31 positions. In comparison, mul- 
talin with a simple ^6/-l matrix (that assies +6 to 
matches and -1 to mismatches) misaligns residues at 34 
positions. In the same lest using blosum 45, 62 and 80, 
MULTALIN misaligned residues at (mly 6-9 positions. Corn- 



Table 1. Performance of substitution matrices in aligning three 
serine proteases 



Matrix 



Residue positions missed* 
All positions Side chains 





MSA 


12 


6 


PAM 120 


MULTALIN 


31 


22 


FAM 160 


MULTALIN 


30 


22 


PAM 250 


MULTALIN 


30 


22 


+6/-1 


MULTALIN 


34 


26 


BLOSUM 45 


MULTALIN 


9 


5 


BLOSUM 62 


MULTALIN 


6 


4 


BLOSUM 60 


MULTALIM 


9 


6 



^Ptom. data of Grew <22), where residues were considered to be 
aJigned whenever aH:a]1x>n8 occupied comparable positions in 
space (All positions coEumn), For a subset (Side cliains column), 
lesidues were excluded where there were difl)»rences In ttw posl^ 
tions of side chains. 

parable numbers were obtained when residues that show 
differences in the positions of side chains were excluded. 
Therefore, blosum matrices produced accurate global align- 
ments of these sequences. 

Perfonnance in Seardiing for Homology in Sequence Data 
Batiks. To determine how blosum matrices perform in data 
bank searches, we first tested them on the guanine nucleo- 
tide-binding protein-coupled receptors, a particularly chal- 
lenging group that has been used previously to test searching 
and alignment programs (10, 18, 23 » 24). Three diverse 
queries, LSHR$RAT, RTA$RAT, and UL33$HCMVA, 
were chosen from among the 114 full-length family members 
catalogued in Prosite based on the observation that none 
detected either of the others in searches. Hie number of 
misses was averaiged tn order to assess the overall searching 
peifonnance of different matiwes for this group. Three 
dififerent programs were used — ^blast (11), fasta (19), and 
Smith-^Waterman (20), blast rapidly determines the best 
ungapped alignments in a data bank, fasta is a heuristic and 
Smith-Watennan is a rigorous local alignment program; both 
can optimize an alignment by the introduction of gaps. 
Several blosum and pam matrices in the entropy range of 
0-15-1.2 were tested. 

Results wifli each of the 3 programs show that all blosum 
matrices tn the 0.3-0.8 range performed better than the best 
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Fio . 2. BLOSUM 62 substitution matrix {Lower) and dififer^Kie maUrix {Upperj obtained by subtracting the pam 160 matrix position by posution. 
These matrices have Identical relative entropies (0.70); the expected value of blosum 62 is -0.52; that for pam 160 is --0^7, 
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PAM matrix, pam 200 (Fig. 3). In this range, each blosum 
mattix missed fewer members than the pam matrix 

with similar reJative entropy* Therefore, blosum improved 
detection of members of this family regardless of the search- 
ing program used. 

To determine whether the superiority of blosum matrices 
over FAM matrices generalizes to other families, we carried 
out similar comparative tests for 504 groups of proteins 
catalogued in Prosite 8.0. For blast, blosum 62 performed 
slightly better overall than blosum 60 or 70, modenktely 
better than blosum 45, and much better than the best pam 
matrix in this test, pam 140 (Fig, 4). Specifically, blosum 62 
was better than fam 140 for 90 groups^ whereas it was worse 
in only 23 other groups. As a baseline for comparison, we 
used the simple +6/-1 matrix, which makes no distinction 
among matches or mismatches. Compared to ^6/--l, blo- 
sum 62 performance was better in 157 groups and was worse 
in 6 groups. Of the 504 groups tested, only 217 showed 
differences in any comparison. Similar results were obtained 
for FASTA (data not shown). 

Very recently, two updates of the Dayhoff matrices have 
appeared (25, 26). Both use automated procedures to cluster 
similar sequences present within an entire protein data base 
and therefore provide considerably more aligned pairs than 
were used by Dayhoff. However, in tests of these matrices 
using BLAST on each of the 504 groups, performance was not 
noticeably different from that of the Dayhoff pam 250 matrix, 
which these matrices were intended to replace, much worse 
than matrices In the blosum series (Fig. 4). Compared to 
blosum 45, which has similar relative entropy to pam 250, 
the matrix of Gonnet et aL (25) was worse in 130 groups and 
better in only 3 groups and the matrix of Jones et al. (26) was 
worse in 138 groups and better in only 5 groups. 
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FiQ. 3 . Searching performance of progr?mis using members of the 
gqanine nucleottde-blnding proteia-coupied receptor family as que- 
ries and matrices from the blosum and pam series scaled in half-^ts 
(il). Removal of this family from the blocks data base led to a itearfy 
identical matrix with similar performance. Matrices represented (left 
to right) are blosum (BL) 30» 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 
SO. 85» and 90 and pam (P) 400, 310. 250, 220. 200, 160, 150, 140, 120, 
110, and 100, The average numbers of true positive Swiss-Prot 
entries missed are shown for LSHRSRAT, RTASKAT, and 
UL33$HCMVA versus Swtss-Prot 20. Results using blast and 
FASTA or ssEARCH (S-W) »rc not comparable to each othen since 
different detection criteria were used for the three programs. 




123/12 



147/ 7 ^6A1 
167/6 

Fig. 4, Searching performance of blast using different m^ces 
from the blosum (BL) series, the pam (P) series, and two recent 
updates of the standard Dayhoff matrix: GCB (25) and JTT (26). 
Results are based on searches using <iueries for each of 504 different 
groups. For each psur of numbers betow a box representing a matrix^ 
the &st is tiie number of groups for which blosum 62 missed fewer 
sequences than that matrix, and the second is the number of groups 
for which BLOSUM 62 missed more. The vertical distance between 
each matrix and BtosvM 62 is proportional to the difference. 

Cmifoiiafioii of a Stispected Relalkmship Between IVans- 
posiMi Op«a Reading Frames. While the tests desoribed above 
demonstrate that blosum matrices perform better overall 
than PAM matrices, an example indicates the extent to which 
this improvement can matter in a real situation. We investi- 
gated a suspected relationship that is biologically attractive 
but is somewhat equivocal when examined by objective 
criteria. Two garoups have noticed a stretch of siinilarity 
between the predicted protein from the Drosophiia mauriti- 
ana mariner transposon m6 that from Caenorhabditis eh- 
gans: Ixansposon Tel {S* Emmons and J. Heierhorst, personal 
communications) (Fig. 5). However, this alignment did not 
score highly enough to allow its detection in searches using 
various pam matrices. In contrast, a blast search with 
ELOSUM 62 using the mariner predicted protein as query 
detected this alignment as the best in the data base (data not 
shown). An analysis shows nonzero scores taken from the 
difference matrix of Fig. 2 assigned to each amino acid pair. 
The hi^er absolute score for blosum 62 compared to pam 
160 (X « 35 for blosum 62 > pam 160 versus X = 14 for 
BLOSUM 62 < PAM 160) results from many small differences. 
When the scores for this alignment were compared to the 
scores for alignments between one of the sequences and 1000 
shumes of the other, the score using blosum 62 was 7.6 SD 
above the mean. In contrast, the score using pam 160 was 
only 3.0 SD ^ove the mean with sitnilar results for pam 250 
and PAM 120, accounting for the failure to detect this rela- 
tionship in previous data base searches. 

Mariner iFLHDWAPSBTARAVEU3TLE1?LKMEVLPaAAYS^^ 

BL62>PieO 23 2 22 1 3 2 3222 2 2 

BIje2<P160 122 11 12 12 1 

Fig. 5. Alignment of D. mauritiana mariner predicted protein 
(amino acids 245-295) with C, etegans TcA (amino acids 235-285) 
encoded by Tcl, Difference scores taken from Fig. 2 are indicated 
just below each alignment position. Using iu>f2 with blosum 62 for 
1000 shitfTles and a window uize of 10» this alignment scores 64, 
compared to a mean of 31.4 (SD «• 4.32) for z « 7.6. With fam 160, 
the score is 43, compared to a mean of 30.1, SD «« 4,63, and t ^ 3.0. 
With FAM 250, z = 2.14; with fam 120, z = 2.98. 



Biochenustry: HemkolF and Henikoff 



Proc. NatL Acad. ScL USA 89 (1992) 1(*?19 



DISCUSSION 

We have found that substitution matrices based on amiao 
acid pairs in blocks of aligned protein segments perform 
better in alignments and homology searches Uian those based 
on accepted mutations in closely related groups. Perfor- 
mance was improved oveimlJ in every test we have done, 
incliuling multiple alignment (multalin)^ detection of un- 
g^pped alignments (blast), detection of ^ped alignments 
(PASTA and SmitliH-Waternuui), and detensin^ton of the 
significance of an alignment (iidp2}. The importance of such 
improved performance can be profound for weakly scoring 
alignments that are not detected in a search or are not trusted. 
For example, the alignment between predicted proteins en- 
coded by mariner and Tel transposons improved by more 
than 4,5 SD above the mean of comparisons to shufOed 
sequences when blosum 62 was used instead of pam matri- 
ces. 

There are fundamental differences between our ^proach 
and that of £>ayhoff that could account for the superior 
performance of blosum matrices in searches and alii^unents. 
Dayhoff estimated mutation rates from substitutions ob- 
served in closely related proteins and extrapolated those 
rates to model distant relationships. In our case, frequencies 
were obtained directly from relationships represented in the 
blocks, regardless of evolutionary distance. Since blocks 
were derived primarily from the most highly conserved 
regions of proteins^ it is possible that many of the difTerences 
between blosum and fam matrices arise from different 
constraints on conserved regions in general. For examplct 
Dayhoff found asparagine to be the most mutable residue, 
whereas, in blocks, asparagine is involved In substitutions at 
ah average frequency. This could mean that an asparagine 
located in a mutable region of a protein is itself highly 
mutable, whereas, when it is located in a conserved region^ 
it shows only an average tendency to be involved in substi- 
tutions. 

Another difference is the iaiger and more representative 
data set used in this work. The Dayhoff frequency table 
included 36 pairs in which no accepted point mutations 
occurred. In contrast, the pairs we counted included no fewer 
than 2369 occurrences of any particular substitution. Scoring 
differences were especially apparent for pairs involving rare 
amino acids such as tryptophan and cysteine. Similar Hndings 
were made in the two recent updates of the EteyhofF matrix 
(25, 26). However, in these studies, no evidence was pre- 
sented that increased data improved performance. Our tests 
show that the updated Dayhoff matrices still perform pooriy 
overall when compared to blosum 62* This suggests that 
matrices from aligned segments in blocks, which represent 
the most highly conserved regions in proteins, are more 
appropriate for searches and alignments than are matrices 
derived by extrapolation from mutation rates. 

The BLOSUM series depends only on the identity and 
composition of groups in Prosite and the accuracy of the 



automated protomat system* While the system itself uses a 
substitution matrix^ iterative application soon leads to nearly 
the same set of scores, even starting with a unitary matrix or 
using a repres^itative subset of the groups. Therefore^ we do 
not expect that these substitution matrices will change sig- 
m0c«mtly in the future. 

The suggestion to make a substitution m^x ftom a blocks data 
base was made by Temple Smith at the 1991 Aspen Center for 
Physics workshop. We thank Scott Emmons and Jdig Heierhorst for 
mdependendy pointing out the similarity between mariner and Tel 
predicted proteins. Bill Pearson for advice, and Domokos Vermes for 
discussions about infonnadoa theory. This work was supp(»ted by 
a grant fitim titt National Institutes of Health. 
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BLOSUM 



From Wikipedia, the free encyclopedia 



BLOSUM (BLOcks of Amino Acid j jAr^ 

substitution Matrix^^^) is a substitution 
matrix used for sequence alignment of 
proteins. BLOSUM are used to score 
alignments between evolutionarily 
divergent protein sequences, BLOSUM is 
based on local alignments, BLOSUM was 
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The BLOSUM62 matrix 



Henikoff. They scanned the BLOCKS 
database for very conserved regions of 
protein families (that do not have gaps in 
the sequence alignment) and then counted 
the relative frequencies of amino acids and 
their substitution probabilities. Then, they 

calculated a log-odds score for each of the 210 possible substitutions of the 20 standard amino acids. All 
BLOSUM are based on observed alignments; they are not extrapolated from comparisons of closely related 
proteins like the PAM Matrices. 

Several sets of BLOSUM exist using different alignment databases, named with numbers. BLOSUM with high 
numbers are designed for comparing closely related sequences, while BLOSUM with low numbers are 
designed for comparing distant related sequences. For example, BLOSUM80 is used for less divergent 
alignments, and BLOSUM45 is used for more divergent alignments. The matrices were created by merging 
(clustering) all sequences that were more similar than a given percentage into one single sequence and then 
comparing those sequences (that were all more divergent than the given percentage value) only; thus reducing 
the contribution of closely related sequences. The percentage used was appended to the name, giving 
BLOSUM80 for example where sequences that were more than 80% identical were clustered. 

Scores within a BLOSUM are log-odds scores that measure, in an alignment, the logarithm for the ratio of the 
likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids 
appearing by chance The matrices are based on the minimum percentage identity of the aligned protein 
sequence used in calculating themJ^^ Every possible identity or substitution is assigned a score based on its 
observed frequences in the alignment of related proteins.!"^! A positive score is given to the more likely 
substitutions while a negative score is given to the less likely substitutions. 

To calculate a matrix for BLOSUM, the following equation is used: Sij = (— \ log ( J 

Here.i:?.. is the probability of two amino acids i and J replacing each other in a homologous sequence, and 
and ^e the background probabilities of finding the amino acids / andy in any protein sequence at random. 
The factor X is a scaling factor, set such that the matrix contains easily computable integer values. 
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External links 

. PageonBLOSUM 

■ Sean R. Eddy (2004). "Where did the BLOSUM62 alignment score matrix come from?". Nature 
Biotechnology 22: 1035. doi:10.1038/nbt0804-1035. PMID 15286655. 
http://informatics.umdnj.edu/bioinformatics/courses/5020/notes/BLOSUM62%20primer.pdf. 

. BLOCKS WWW server 

■ Scoring systems for BLAST at NCBI 

- Data files of BLOSUM on the NCBI FTP server. 

See also 

• Sequence alignment 

■ Point accepted mutation 
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EXHIBIT B 



Glycosylation 



Fn)m Wikipedia, the free encyclopedia 

Glycosylation is the enzymatic process that links saccharides to produce glycans, either free or attached to 
proteins and lipids. This enzymatic process produces one of four fundamental components of all cells (along 
with nucleic acids, proteins, and lipids) and also provides a co-translational and post-translational modification 
mechanism that modulates the structure and function of membrane and secreted proteins. The majority of 
proteins synthesized in the rough ER undergo glycosylation. It is an enzyme-directed site-specific process, as 
opposed to the non-enzymatic chemical reaction of glycation. Glycosylation is also present in the cytoplasm 
and nucleus as the O-GlcNAc modification. Six classes of glycans are produced: A'-linked glycans attached to 
the amide nitrogen of asparagine side chains, 0-linked glycans attached to the hydroxy oxygen of serine and 
threonine side chains; glycosaminoglycans attached to the hydroxy oxygen of serine; glycolipids in which the 
glycans are attached to ceramide, hyaluronan which is unattached to either protein or lipid, and GPI anchors 
which link proteins to lipids through glycan linkages. 
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Purpose 

The polysaccharide chains attached to the target proteins serve various functions. For instance, some proteins 
do not fold correctly xmless they are glycosylated first. Also, polysaccharides linked at the amide nitrogen of 
asparagine in the protein confer stability on some secreted glycoproteins. Experiments have shown that 
glycosylation in this case is not a strict requirement for proper folding, but the unglycosylated protein degrades 
quickly. Glycosylation may play a role in cell-cell adhesion (a mechanism employed by cells of the immune 
system), as well 

Mechanisms 

There are various mechanisms for glycosylation, although all share several common featiares: 

m Glycosylation is an enzymatic process; 
. m The donor molecule is an activated nucleotide sugar; 
m The process is site-specific. 

A'^linked glycosylation 

AMinked glycosylation is important for the folding r ™ -^-^^ - - 

of some eukaryotic proteins. The A^-linked 

glycosylation process occurs in eukaryotes and widely in archaea, but very rarely in bacteria. 



For A^-Hnked oligosaccharides, a 14-sugar precursor 
is first added to the asparagine in the polypeptide 
chain of the target protein. The structure of this 
precursor is common to most eukaryotes, and 
contains 3 glucose, 9 mannose, and 2 N- 
acetylglucosamine molecules. A complex set of 
reactions attaches this branched chain to a carrier 
molecule called dolichol, and then it is transferred 
to the appropriate point on the polypeptide chmn as 
it is translocated into the ER lumen. 




There are three major types of iV'-linked 
saccharides: high-mannose oligosaccharides, 
complex oligosaccharides and hybrid 
oligosaccharides. 



' i*^.*^!?^ ^^m^^^^^. im^m^.. 



Comparative overview of the major types of vertebrate N- 
glycan subtypes and some representative C elegans N- 
glycans. 



■ High-mannose is, in essence Jmt two JV- ^ " ™ 
acetylglucosamines with many maimose 

residues, often almost as many as are seen in the precursor oligosaccharides before it is attached to the 
protein, 

■ Complex oligosaccharides are so named because they can contain almost any niunber of the other types 
of saccharides, including more than the original two iV-acetylglucosamines, 

Proteins can be glycosylated by both types of oligos on different portions of the protein. Whether an 
oligosaccharide is high-mannose or complex is thought to depend on its accessibility to saccharide-modifying 
proteins in the Golgi, If the saccharide is relatively inaccessible, it will most likely stay in its original high- 
mannose form. If it is accessible, then it is likely that many of the mannose residues will be cleaved off and the 
saccharide will be further modified by the addition of other types of group as discussed above. 

The oligosaccharide chain is attached by oligosaccharyltransferase to asparagine occxirring in the tripeptide 
sequence Asn-X-Ser or Asn-X-Thr where X could be any amino acid except Pro. This sequence is known as a 
glycosylation sequon. After attachment, once the protein is correctly folded, the three glucose residues are 
removed from the chain and the protein is available for export from the ER. The glycoprotein thus formed is 
then transported to the Golgi where removal of further mannose residues may take place. However, 
glycosylation itself does not seem to be as necessary for correct transport targeting of the protein, as one might 
think. Studies involving drugs that block certain steps in glycosylation, or mutant cells deficient in a 
glycosylation enzyme, still produce otherwise-stracturally-normal proteins that are correctly targeted, and this 
interference does not seem to interfere severely with the viability of the cells. Mature glycoproteins may 
contain a variety of oligomannose A^-linked oligosaccharides containing between 5 and 9 maimose residues. 
Further removal of mannose residues leads to a 'core' structure containing 3 mannose, and 2 N- 
acetylglucosamine residues, which may then be elongated with a variety of different monosaccharides 
including galactose, JV-acetylglucosamine, AT-acetylgalactosamine, fucose and sialic acid. 

O-Iinked glycosylation 
O-N-acetylgalactosamine (O-GalNAc) 

O-linked glycosylation occurs at a later stage during protein processing, probably in the Golgi apparatus. This 
is the addition of N-acetyl-galactosamine to serine or threonine residues by the enzyme UDP-N-acetyl-D- 
galactosamtne:polypeptide N-acetylgalactosaminyltransferase (EC 2,4 J .41), followed by other carbohydrates 
(such as galactose and sialic acid). This process is important for certain types of proteins such as proteoglycans, 
which involves the addition of glycosaminoglycan chaii^ to an initially unglycosylated "proteoglycan core 
protein," These additions are usually serine (9-linked glycoproteins, which seem to have one of two main 



functions. One function involves secretion to form components of the extracellular matrix, adhering one cell to 
another by interactions between the large sugar complexes of proteoglycans. The other main function is to act 
as a component of mucosal secretions, and it is the high concentration of carbohydrates that tends to give 
mucus its "slimy" feel. Proteins that circulate in the blood are not normally O-glycosylated, with the exception 
of IgAl and IgD (two types of antibody) and CI -inhibitor, 

C?-fucose 

O-fucose is added between the second and third conserved cysteines of EOF -like repeats in the Notch protein, 
and other substrates by GDP-fucose protein O-fucosyltransferase 1, and to Thrombospondin repeats by GDP- 
fucose protein O-fucosyltransferase 2. In the case of EGF-like repeats, the O-fucose may be further elongated 
to a tetrasaccharide by sequential addition of N-acetylglucosamine (GlcNAc), galactose, and sialic acid, and for 
Thrombospondin repeats, may be elongated to a disaccharide by the addition of glucose. Both of these 
fucosyltransferases have been localized to the endoplasmic reticulum, which is unusual for 
glycosyltransferases, most of which function in the Golgi apparatus* 

O-glucose 

O-glucose is added between the first and second conserved cysteines of EGF-like repeats in the Notch protein, 
and possibly other substrates by an unidentified O-glucosyltransferase. 

O-A^-acetylglucosamine (O-GlcNAc) 

a-GlcNAc is added to serines or threonines by O-GlcNAc transferase, C?-GlcNAc appears to occur on serines 
and threonines that would otherwise be phosphorylated by serine/threonine kinases. Thus, if phosphorylation 
occurs, O-GlcNAc does not, and vice versa. This is an incredibly important finding because 
phosphorylation/dephosphorylation has become a scientific paradigm for the regulation of signaling within 
cells. A massive amount of cancer research is focused on phosphorylation. Ignoring the involvement of this 
form of glycosylation, which clearly appears to act in concert with phosphorylation, means that a lot of current 
research is missing at least half of the picture. O-GlcNAc addition and removal also appear to be key regulators 
of the pathways that are deregulated in diabetes mellitus. The gene encoding the O-GlcNAc removal enzyme 
has been linked to non-insulin dependent diabetes mellitus. It is the terminal step in a nutrient-sensing 
hexosamine signaling pathway. 

GPI anchor 

^ special form of glycosylation is the GPI anchor. This form of glycosylation functions to attach a protein to a 
hydrophobic lipid anchor, via a glycan chain, (see also prenylation) 

C-oiannosylatioii 

A mannose sugar is added to tryptophan residues in Thrombospondin repeats. This is an unusual modification 
both because the sugar is linked to a carbon rather than a reactive atom like a nitrogen or oxygen and because 
the sugar is linked to a tryptophan residue rather than an asparagine or serine/threonine. 

See also 

K Glycation 

m Advanced glycation endproduct 
m Chemical glycosylation 

External links 

m Online textbook of giycobiology with chapters about glycosylation 



■ GlyProt: In^silico N-glycosylation of proteins on the web 

m NetNGlyc: The NetNglyc server predicts N-Glycosylation sites in human proteins using artificial neural 
networks that examine the sequence context of Asn-Xaa-Ser/Thr sequons. 

Retrieved from "http://en.wikipedia,org/wiki/Glycosylation" 

Categories: Posttranslational modification | Organic chemistry | Carbohydrates | Carbohydrate chemistry 
Hidden categories: Articles lacking soxarces from December 2007 I All articles lacking sources 

■ This page was last modified on 24 April 2009, at 13:41 (UTC). 

■ All text is available under the terms of the GNU Free Documentation License. (See Copyrights for 
details.) 

Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a U.S. registered 501(c)(3) tax- 
deductible nonprofit charity* 



EXHIBIT C 



PEGylation 



From Wikipedia, the free encyclopedia 



PEGylation is the process of covaient attachment of poly(ethylene glycol) 
polymer chains to another molecule, normally a drug or therapeutic protein, 
PEGylation is routinely achieved by incubation of a reactive derivative of 
PEG with the target macromolecule. The covaient attachment of PEG to a 
drug or therapeutic protein can "mask" the agent from the host's immune 
system (reduced immunogenicity and antigenicity), increase the 

hydrodynamic size (size in solution) of the agent which prolongs its circulatory time by reducing renal 
clearance. PEGylation can also provide water solubility to hydrophobic drugs and proteins. 



Polyethylene glycol 
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History 

In 1970s, pioneering research by Dr. Frank Davis, Dr. Abraham Abuchowski and colleagues foresaw the 
potential of the conjugation of Polyethylene glycol (PEG) to proteins. Dr. Abuchowski founded Enzon, Inc, 
which brought three PEGylated drugs to market, and is the founder and president of Prolong Pharmaceuticals. 

PEGylation, is a process of attaching the strands of the polymer PEG to molecules most typically peptides, 
proteins, and antibody fragments, that can help to meet the challenges of improving the safety and efficiency of 
many therapeutics. It produces alterations in the physiochemical properties including changes in conformation, 
electrostatic binding, hydrophobicity etc. These physical and chemical changes increase systemic retention of 
the therapeutic agent. Also, it can injfluence the binding affinity of the therapeutic moiety to the cell receptors 
and can alter the absorption and distribution patterns, 

PEGylation, by increasing the molecular weight of a molecule, can impart several significant pharmacological 
advantages over the unmodified form, such as: 



• Improved drug solubility 

• Reduced dosage frequency, without diminished efficacy with potentially reduced toxicity 



• Extended circulating life 

• Increased drug stability 



• Enhanced protection from proteolytic degradation 

The PEGylated drugs are having the foUowmg conmiercial advantages also: 

• Opportunities for new delivery formats and dosing regimens 

• Extended patent life of previously approved drugs 



PEGylated Pharmaceuticals on the Market 



The clinical value of PEGylation is now well established. ADAGEN (PEG- bovine adenosine deaminase) 
manufactured by Enzon Pharmaceuticals, Inc., US was the first PEGylated protein approved by FDA in March 
1990, to enter the market. It is used to treat X-linked severe combined immunogenicity syndrome, as an 
altemative to bone marrow transplantation and enzyme replacement by gene therapy. Since the introduction of 
ADAGEN, a large number of PEGylated protein and peptide pharmaceuticals have followed and many others 
are under clinical trial or under development stages. Some of the successful examples are: 

• PEGASYS: PEGylated interferon alpha for use in the treatment of chronic hepatitis C and hepatitis B 
(Hoffman-La Roche) 

• Pegintron: PEGylated interferon alpha for use in the treatment of chronic hepatitis C and hepatitis B 
(Schering-Plough / En2x>n) 

• Oncaspar: PEGylated L-asparaginase for the treatment of acute lymphoblastic leukemia in patients who are 
hypersensitive to the native unmodified form of L-asparaginase (Enzon). This drug was recently approved for 
front line use. 

• Neulasta: PEGylated recombinant methionyl human granulocyte colony-stimulating factor for severe cancer 
chemotherapy induced neutropenia (Amgen) 

• Doxil: PEGylated liposome containing doxorubicin for the treatment of Cancer (Sequus) 

PEG Moiety Properties 

PEG is a particularly attractive polymer for conjugation. The specific characteristics of PEG moieties relevant 
to pharmaceutical applications are: 

• Water solubility 

• High mobility in solution 

• Lack of toxicity and immunogenicity 

• Ready clearance from the body 

• Altered distribution in the body 

PEGylation Process 

The first step of the PEGylation is the suitable functionalization of the PEG polymer at one or both terminals. 
PEGS that are activated at each terminus with the same reactive moiety are known as *'homobifunctionar', 
where as if the functional groups present are different, then the PEG derivative is referred as 
*'heterobifunctional" or "heterofunctional" The chemically active or activated derivatives of the PEG polymer 
are prepared to attach the PEG to the desired molecule. 

The choice of the suitable functional group for the PEG derivative is based on the type of available reactive 
group on the molecule that will be coupled to the PEG. For proteins, typical reactive amino acids include 
lysine, cysteine, histidine, arginine, aspartic acid, glutamic acid, serine, threonine, tyrosine. The N-terminal 
amino group and the C-terminal carboxylic acid can also be used. 

The techniques used to form fit^t generation PEG derivatives ^e generally reacting the PEG polymer with a 
group that is reactive with hydroxyl groups, typically anhydrides, acid chlorides, chloroformates and 



carbonates. In the second generation PEGylation chemistry more efficient functional groups such as aldehyde, 
esters, amides etc made available for conjugation. 

As applications of PEGylation have become more and more advanced and sophisticated, there has been an 
increase in need for heterobiftmctional PEGs for conjugation. These heterobifunctional PEGs are very useful in 
linking two entities, where a hydrophilic, flexible and biocompatible spacer is needed. Preferred end groups for 
heterobifunctional PEGs are maleimide, vinyl sulfones, pyridyl disulfide, amine, carboxylic acids and NHS 
esters. 
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